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Abstract 

Standard measures of batting performance such as a batting av¬ 
erage and an on-base percentage can be decomposed into component 
rates such as strikeout rates and home run rates. The likelihood of 
hitting data for a group of players can be expressed as a product of 
likelihoods of the component probabilities and this motivates the use 
of random effects models to estimate the groups of component rates. 
This methodology leads to accurate estimates at hitting probabilities 
and good predictions of performance for following seasons. This ap¬ 
proach is also illustrated for on-base probabilities and FIP abilities of 
pitchers. 
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1 Introduction 


Efron and Morris (1975) demonstrated the benefit of simultaneous estima¬ 
tion using a simple example of using the batting outcomes of 18 players in 
the first 45 at-bats in the 1970 season to predict their batting average for 
the remainder of the season. Essentially, improved batting estimates shrink 
the observed averages towards the average batting average of all players. 
One common way of achieving this shrinkage is by means of a random ef¬ 
fects model where the players’ underlying probabilities are assumed to come 
from a common distribution, and the parameters of this “random effects” 
distribution are assigned vague prior distributions. 

In modern sabermetrics research, a batting average is not perceived to 
be a valuable measure of batting performance. One issue is that the batting 
average assigns each possible hit the same value, and it does not incorporate 
in-play events such as walks that are beneficial for the general goal of scoring 
runs. Another concern is that the batting average is a convoluted measure 
that combines different batting abilities such as not striking out, hitting a 
home run, and getting a hit on a ball placed in-play. Indeed, it is difficult 
to really say what it means for a batter to “hit for average”. Similarly, 
an on-base percentage does not directly communicate a batter’s different 
performances in drawing walks or getting base hits. 

A deeper concern about a batting average is that chance plays a large role 
in the variability of player batting averages, or the variability of a player’s 
batting average over seasons. Albert (2004) uses a beta-binomial random 
effects model to demonstrate this point. If a group of players have 500 at- 
bats, then approximately half the variability in the players’ batting average is 
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due to chance (binomial) variation the remaining half is due to variability in 
the underlying player’s hitting probabilities. In contrast, other batting rates 
are less affected by chance. For example, only a small percentage of players 
observed home run rates are influenced by chance - much of the variability 
is clue to the differences in the batters’ home run abilities. 

The role of chance has received recent attention to the development of 
FIP (fielding independent pitching) measures. McCracken (2001) made the 
surprising observation that pitchers had little control of the outcomes of 
balls that were put in-play. One conclusion from this observation is that the 
BABIP, batting average on balls pnt in-play, is largely influenced by lnck 
or binomial variation, and the FIP measure is based on outcomes such as 
strikeouts, walks, and home runs that are largely under the pitcher’s control. 

Following Bickel (2004), Albert (2004) illustrated the decomposition of a 
batting average into different components and discussed the luck/skill aspect 
of different batting rates. In this paper, similar decompositions are used to 
develop more accurate predictions of a collection of batting averages. Es¬ 
sentially, the main idea is to first represent a hitting probability in terms 
of component probabilities, estimate groups of component probabilities by 
means of random effects models, and use these component probability esti¬ 
mates to obtain accurate estimates of the hitting probabilities. Sections 3, 
4, 5 illustrate the general ideas for the problem of simultaneously estimating 
a collection of “batting average” probabilities and Section 8 demonstrates 
the usefulness of this scheme in predicting batting averages for a following 
season. Section 7 illustrates this plan for estimating on-base probabilities. 
Section 8 gives a historical perspective on how the different component hit- 
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ting rates have changed over time, and Section 9 illustrates the use of this 
perspective in understanding the career trajectories of hitters and pitchers. 
The FIP measure is shown in Section 10 as a function of particular hitting 
rates and this representation is used to develop useful estimates of pitcher 
FIP abilities. Section 11 concludes by describing several modeling extensions 
of this approach. 

2 Related Literature 

Since Efron and Morris (1975), there is a body of work finding improved mea¬ 
sures of performance in baseball. Tango et al (2007) discuss the general idea 
of estimating a player’s true talent level by adjusting his past performance 
towards the performance of a group of similar players and the appendix 
gives the familiar normal likelihood/normal prior algorithm for performing 
this adjustment. Brown (2008), McShane et al (2011), Neal et al (2010), 
and Null (2009) propose different “shrinking-type” methods for estimating 
batting abilities for Major League batters. Similar types of methods are pro¬ 
posed by Albert (2006), Piette and James (2012), and Piette et al (2010) 
for estimating pitching and fielding metrics. Albert (2002) and Piette et al 
(2012) focus on the problem of simulataneously estimating player hitting and 
fielding trajectories. 

Albert (2004), Bickcl (2004) and Bickel and Stotz (2003) describe de¬ 
composition of batting averages. Baurner (2008) performs a similar decom¬ 
position of batting average (BA) and on-base percentage ( OBP) with the 
intent of showing mathematically that BA is more sensitive than OBP to 
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the batting average on balls in-play. 


3 Decomposition of a Batting Average. 

The basic decomposition of a batting average is illustrated in Figure 1. Sup¬ 
pose one divides all at-bats into strikeouts (SO) and not-strikeouts (Not SO). 
Of the AB that are not strikeouts, we divide into the home runs (HR) and 
the balls that are put “in-play”. Finally, we divide the balls in-play into the 
in-play hits (HIP) and the in-play outs (OIP). 



Figure 1: Breakdown of an at-bat. 

This representation leads to a decomposition of the batting average H/AB. 
We first write the proportion of hits as the proportion of AB that are not 
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strikeouts times the proportion of hits among the non-strikeouts. 

H _ ( SO\ ^ H 
AB ~ V ~ AHJ X AB-SO' 

Continuing, if we breakdown these AB — SO opportunities by HR, then we 
write the hit proportion as the proportion of non-strikeouts that are home 
runs plus the proportion of non-strikeouts that are singles, doubles, or triples 
(H — HR). 

H _ HR H - HR 
AB-SO ~ AB-SO + AB-SO 

Finally, we write the proportion of non-strikeouts that are singles, doubles, 

or triples as the proportion of non-strikeouts that are not home runs times 

the proportion of balls in-play ( AB — SO — HR) that are hits. 

H — HR _ / HR \ H -HR 
AB-SO ~ V ~ AB-SO) X AB-SO-HR' 

Putting it all together, we have the following representation of a batting 

average BA = H/AB : 

BA = (1 — SO.Rate) x [HR.Rate + (1 — HR.Rate) x BABIP ), 

where the relevant rates are: 


the strikeout rate 


the home run rate 


SO. Rate = 


HR. Rate = 


SO 

AB 

HR 


AB-SO 

the batting average on balls in play rate 

BAB,P = H ~ HR 


AB-SO- HR 
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Instead of simply recording hits and outs, we are regarding the outcomes of 
an at-bat as multinomial data with the four outcomes SO, HR, HIP, and 
OIP. 

4 Multinomial Data and Likelihood 

The decomposition of a batting average leads to a multinomial sampling 
model for the hitting outcomes, and the multinomial sampling leads to an 
attractive representation of the likelihood of the underlying probabilities. 

There are four outcomes of an at-bat: SO, HR, HIP, and OIP (strikeout, 
home run, hit-in-play, and out-in-play). Let pso denote the probability that 
an at-bat results in a strikeout, let phr denote the probability a non-strikeout 
results in a home run, and Phip denotes the probability that a ball-in-play 
(not SO or HR) results in an in-play hit. If this player has n at-bats, the 
vector of counts of SO, HR, HIP, and OIP is multinomial with corresponding 
probabilities p so , (1 - Pso)Phr , (1 - Pso){ 1 - Phr)Phip , and (1 - p S o){ 1 - 
Phr){ 1 — Phip)- These expressions are analogous to the breakdowns of the 
hitting rates. For example, the probability a hitter gets a home run in an at- 
bat is equal to the probability that the person does not strike out (1 — Pso), 
times the probability that the hitter gets a home run among all the non¬ 
strikeouts ( Phr )• Likewise, the probability a batter gets a hit in-play is the 
probability he does not strikeout times the probability he does not get a 
home run in a non-strikeout times the probability he gets a hit in a ball put 
into play. 

Denote the multinomial counts for a particular player as ( DschUHR , Vhip, n— 
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Uso ~Vhr — Uhip ) where n is the total number of at-bats. The likelihood of 
the associated probabilities is given by 

L = p ys ° x ((1 - Pso)phr) Vhr x ((1 - p so )( 1 - Phr)phip) Vhip 
X ((1 - p so )( 1 - Phr){ 1 - p HIp )Y-yso-VHn-y HIP 

With some rearrangement of terms, one can show that the likelihood has a 
convenient factorization: 

L = [p y s S S{l-PS0Y- ySO ] X [p V H H £(l-pHRY- yS °- y ™ 

X [p y HiP (1 - p HIP ) n -yso-yun-yHi P 

= Li x L -2 x L 3 

From the above representation, we see 

• Li is the likelihood for a binomial(n, pso) distribution 

• L 2 is the likelihood for a binomial(n — DscpPhr) distribution 

• L 3 is the likelihood for a binomial(n — yso ~ Vhr,Phip ) distribution 

Above we consider the multinomial likelihood for a single player, when 
in reality we have N hitters with unique hitting probabilities. For the jth 
player, we have associated probabilities PsoiPhriP^hip- Following the same 
factorization, it is straightforward to show that the likelihood of the vectors 
of probabilities p S o = { Pso L Phr = { Phr }> and Phip = {Phip} is S iven by 

N 

T(pscn Phr, Phip) = II (L{.L{l{) 

3 =1 

N N N 

= 11 b x II x II /./j 

j = 1 i =1 i =1 
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5 Exchangeable Modeling 


5.1 The Prior 


The factorization of the likelihood motivates an attractive way of simultane¬ 
ously estimating the multinomial probabilities for all players. Suppose that 
one’s prior belief about the vectors pso- Phr, and phip are independent, and 
we represent each set of probabilities by an exchangeable model represented 
by a multilevel prior structure. 

In particular, suppose that the strikeout probabilities are believed to be 
exchangeable. One way of representing this belief is by the following mixture 
of betas model. 


Psch ft Pso are independent Beta(Kso,Vso), where a Beta(K,r /) den¬ 
sity has the form 

1 


9 ip) = 


-p Kv ~\ 1 - 0 < p < 1 , 


B(Kr), K{1 — rj)) 

and B(a,b ) is the beta function. The parameter rj is the prior mean 
and K is a “precision” parameter in the sense that the prior variance 
77(1 — rj)/{K + 1) is a decreasing function of K. 


• The beta parameters ( K S o, Vso ) are assigned the vague prior 

g(Kso,Vso) oc-—- w-, j - —, K so > 0,0 < rj S o < 1- 

VsoO- ~ Vso)0- + K so) 

Similarly, we represent a belief in exchangeability of the home run probabil¬ 
ities in Phr by assigning a similar two-stage prior with unknown hyperpa¬ 
rameters Khr, and Phr- Likewise, an exchangeable prior on the hit-in-play 
probabilities Phip is assigned with hyperparameters Khip, and rjnip- 
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5.2 The Posterior 


We saw that the likelihood function factors into independent components 
corresponding to the SO, HR, and HIP data. Since the prior distributions 
of the probability vectors pso, Phr, and Phip are independent, it follows 
that these probability vectors also have independent posterior distributions. 
We summarize standard results about the posterior of the vector of strikeout 
probabilities pso with the understanding that similar results follow for the 
other two vectors. 

The posterior distribution of the vector pso can be represented by the 
product 

fi'(psoldata) = #(pso|-fCso, Vso-, data) x g(K so ,vso (data), 

where g(K S o , Vso |data) is the posterior distribution of the parameters of the 
random effects distribution, and g(pso\Kso, Vso , data) is the posterior of the 
probabilities conditional on the random effects distribution parameters. We 
discuss each distribution in turn. 


Random effects distribution 


The random effects distribution g(r), iP(data) represents the “talent curve” 
of the players with respect to strikeouts, and the posterior mode (fjso, Kso) 
of this distribution tells us about the center and spread of this distribution. 
In particular, fjso represents the average strikeout rate among the players, 
and the estimated standard deviation 


SD(p) 


\ 


Vso- Vso ) 


&so + 1 

measures the spread of this talent curve. 
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Probability estimates 


Given values of the parameters pso and Kso, the individual strikeout prob¬ 
abilities pgQ, ..., pg 0 have independent beta distributions where p J so is beta 
with shape parameters fi so + KsoVso and n ] — y J so + Kso( 1 — Vso), where 
yg 0 and n J represent the number of strikeouts and at-bats for the jth player. 

Using this representation, the posterior mean of the strikeout rate for the 
jth player is given by 


E(p J so \data,'r]so,Kso) = 


Vso + KsoVso 


n J + Kso 

Plugging in the posterior estimates for 77 so and Kso, we get the posterior 
estimate of the jth player striking out: 

Vso + KsoVso 

Pso = -- • 

n J + A so 

The same methodology was used to estimate the home run probabilities 
and the hit-in-play probabilities for all players - denote the three sets of 
estimates as { Pso{P hr}, and {Phip}, respectively. One can use these 
estimates to estimate the hitting probabilities using the representation 


Ph — (1 — Pso ) x (Phr + (1 — Phr ) x Phip ) • 

For an individual estimate, by substituting the “component” estimates pg 0 , 
p J HR , and fiiiip into this expression, we obtained a “component” estimate of 
p°ii that we denote by fi H . 


5.3 Example: 2011 Season 

To illustrate the use of these exchangeable models, we collect hitting data 
for all players with at least 100 at-bats in the 2011 season. (We used 100 
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AB as a minimum number of at-bats to exclude pitchers from the sample.) 
Three exchangeable models are fit, one to the collection of strikeout rates 
{yso/ n j}i one the collection of home run rates {y 3 HR / ( rij — y 3 so )}? an d one 
to the collection of in-play hit rates {y 3 Hip/( n j ~ Vso ~ Vhr)}- 

Table 1 displays values of the random effect parameters rj and K for each 
of the three fits. The average strikeout rates is about 20%, the average home 
run rate (among AB removing strikeouts) is 3.6%, and the average in-play hit 
rate is 30%. The estimated values of K are informative about the spreads 
of the associated probabilities. The relatively small estimated value of K 
for SO reflects a high standard deviation, indicating a large spread in the 
player strikeout probabilities. The estimated value of K for home runs is 
also relatively small, indicating a large spread in home run probabilities. In 
contrast, the estimated value of K for hits in play is large, indicated that 
players’ abilities to get in-play hits are more similar. 

Table 1: Estimates of random effect parameters for strikeout data, home run 
data, and in-play hit data from the 2011 season. 



SO 

HR 

H 

V 

0.203 

0.0369 

0.303 

K 

40.60 

65.70 

418.10 

SD 

0.062 

0.023 

0.022 


These estimates of K and rj can be used to compute “improved” estimates 
at player strikeout probabilities, home run probabilities, and in-play hit prob¬ 
abilities, and these component estimates can be used to obtain estimates at 
player hit probabilities. 
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To illustrate these calculations, consider Carlos Beltran with the hitting 
statistics displayed in Table 2. 

Table 2: Batting data for Carlos Beltran for the 2011 season. 


AB 

SO 

HR 

H 

Count 520 

88 

22 

156 


His three observed rates are SORate = 88/520 = 0.169, HRRate = 
22/(520 - 88) = 0.051, and BAB IP = (156 - 22)/(520 - 88 - 22) = 0.327. 
From the fitted model, these observed rates are shrunk or adjusted towards 
average values using the formulas: 


Pso 


88 + 40.60 x 0.203 
520 + 40.60 


0.172 


Phr 

Phip = 


_ 22 + 65.70 x 0.0369 _ 
520 - 88 + 65.70 
156 - 22 + 418.10 x 0.303 
520 - 88 - 22 + 418.10 


0.049 
= 0.315. 


(Note that Beltran’s strikeout and home runs are slightly adjusted towards 
the average values due to the small estimated values of K. In contrast, 
the consequence of the large estimated K is that Beltran’s in-play hit rate 
is adjusted about half of the way towards the average value.) Using these 
estimates, Beltran’s hit probability is estimated to be 


pH = (1 - 0.172) x (0.049 + (1 - 0.049) x 0.315) = 0.289, 

which is smaller than his observed batting average of 156 / 520 = 0.300. Much 
of this adjustment in his batting average estimate is due to the adjustment 
in Bcltan’s in-play hitting rate. 
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6 Evaluation 


6.1 A Single Exchangeable Model 


If one is primarily interested in estimating the hitting probabilities, there is 
a well-known simpler alternative approach based on an exchangeable model 
placed on the probabilities. If the hitting probability of the jth player is given 
by pb, then one can assume that p ] Hl ...,p^ are distributed from a common 
beta curve with parameters K and 77 , and assign (K, 77 ) a vague prior. Then 
the posterior mean of p J H is approximated by 


Ph 


Vh + Kfj 
7 V + K ’ 


where the jth player is observed to have y J H hits in n? at-bats and K and 77 
are estimates from this exchangeable model. 


6.2 A Prediction Contest 

The following prediction contest is used to compare the proposed component 
estimates {p J H } with the batting average estimates {Ph} 

• First we collect hitting data for all players with at least 100 at-bats in 
both 2011 and 2012 seasons. 

• Fit the component model on the hitting data for 2011 season - get 
estimates of the strikeout, home run, and in-play hit probabilities for 
all hitters and use these three sets of estimates to get the component 
estimates {pP H }- 
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Use the hit/at-bat data for all players in the 2011 season and the single 
exchangeable model to compute the estimates. 


• Use both the component estimates and the single exchangeable esti¬ 
mates to predict the batting averages of the players in the 2012 season. 
Let Wj and rn 0 denote the number of hits and at-bats of the jth player 
in the 2012 season. Compute the root sum of squared prediction errors 
for both methods. 


Sc = 


Si = 


\ 




Til, 


\ 


E 


m-i 


2 


The improvement in using the components estimates is / = Si — Sc- A 
positive value of / indicates that the component estimates are providing 
closer predictions than the single exchangeable estimates. 


This prediction contest was repeated for each of the seasons 1963 through 
2012. Batting data for all players in seasons y and y + 1 were collected, y 
= 1963, ..., 2012. The component and single exchangeable models were 
each fit to the data in season y and the improvement in using the component 
method over the single exchangeable model in predicting the batting averages 
in season y + 1 was computed. Figure 2 graphs the prediction improvement 
as a function of the season. It is interesting that the component estimates 
were not uniformly superior to the estimates from a single exchangeable 
model. However the component method appears generally to be superior to 
the standard method, especially for seasons 1963-1980 and 1995-2012. 
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Figure 2: Improvement in error in predicting batting averages by using the 
component method for each of the seasons 1963 through 2012. 

7 On-Base Percentages 

7.1 Decomposition 

We have focused on the decomposition of an at-bat. In a similar manner, 
one can decompose a plate appearance as displayed in Figure 3. If we ignore 
sacrifice hits (both SH and SF), then one can express an on-base percentage 
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as 


OBP 


r^i 


H + BB + HBP 
AB + BB + HBP' 


If one combines walks and hit by pitches and defines the “Walk Rate” 


Walk.Rate 


BB + HBP 
AB + BB + HBP ’ 


then one can write 


OBP ~ Walk.Rate + (1 — Walk.Rate) x BA, 
where BA = H/AB is the batting average. 



Figure 3: Breakdown of a plate appearance. 

This representation makes it clear that an OBP is basically a function of 
a hitter ability to draw walks, as measured by the walk rate and his batting 
average. Also, following the logic of the previous section, this representation 
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suggests that one may accurately estimate a player’s on-base probability by 
combining separate accurate estimates of his walk probability and his hitting 
probability. 

7.2 Estimating On-Base Percentages 

In this setting, one can simultaneously estimate on-percentages of a group 
of players by separately estimating their walk probabilities and their hit 
probabilities. One represents a probability that a player gets on-base pos as 

Pob = Pbb x (1 + (1 — Pbb) X Ph) ■ 

This suggests a method of estimating a collection of on-base probabilities. 

1. Estimate the walk probabilities {p l BB } by use °f an exchangeable model. 

2. Estimate the hitting probabilities { p B } by use of an exchangeable 
model. 

3. Estimate the on-base probabilities by use of the formula 

Pob = Pbb x (l + (1 - Pbb) x Ph) , 

where p° BB and p J H are estimates of the walk probability and the hit 
probability for the jtli player. 

Figure 4 demonstrates the value of this method in providing better pre¬ 
dictions. As in the “prediction contest” of Section 5.2, we are interested in 
predicting the on-base probabilities for one season given hitting data from 
the previous season. Two prediction methods are compared - the “single ex¬ 
changeable” method fits one exchangeable data using the on-base fractions, 
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and the “component” method separately estimates the walk rates and hit¬ 
ting rates for the players. One evaluates the goodness of predictions by the 
square root of the sum of squared prediction errors and one computes the 
improvement in using the component procedure over the single exchangeable 
method. These methods are compared for 50 prediction contests using data 
from each of the seasons 1963 through 2012 to predict the on-base propor¬ 
tions for the following season. As most of the points fall above the horizontal 
line at zero, this demonstrates that the component method generally is an 
improvement over the one exchangeable method. 



Figure 4: Improvement in error in predicting on-base percentages by using 
the component method for each of the seasons 1963 through 2012. 
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8 Historical Perspective of Hitting Rates 


To obtain a historical perspective of the change in hitting rates, the basic 
exchangeable model was fit to rates for all batters with at least 100 AB 
for each of the seasons 1960 through 2012. For each season, we estimate 
the mean talent fj and associated precision parameter K - the associated 
estimated posterior standard deviation of the talent distribution is 


SD(p) ~ 


* 7(1 ~ v) 

K + 1 


Figure 5 displays the pattern of mean strikeout rates for all batters with 
at least 100 AB. Note that the average strikeout rate among batters initially 
showed a decrease from 1970 through 1980 but has steadily increased until 
the current season. If we performed fits of the exchangeable model for all 
pitchers for each season from 1960 through 2012, one would see a similar 
pattern in the mean strikeout rates. 

Figure 6 displays the estimated standard deviations of the strikeout abil¬ 
ities of all batters with at least 100 AB across seasons and overlays the 
estimated season standard deviations of the strikeout abilities of all pitchers. 
First, note that among batters, the spread of the strikeout abilities shows a 
similar pattern to the mean strikeout rate - there is a decrease from 1970 to 
1980 followed by a steady increase to the current day. The spread of strikeout 
abilities among pitchers shows a different pattern. The standard deviations 
for pitchers have steadily increased over seasons, and the spread in the talent 
distribution for pitchers is significantly smaller than the spread of the talents 
for batters. 
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Mean of Strikeout Rates: SO / AB 



I960 1980 2000 

Season 


Figure 5: Plot of mean strikeout rates for batters (at least 100 AB) for 
seasons 1960 through 2012. 

9 Career Trajectories 

9.1 Predictive Residuals 

One way of measuring the effectiveness of a batter or a pitcher is to look at 
the vector of rates (BB.Rate, SO.Rate, HR.Rate, BABIP) for a particular 
season. Plotting these rates over a player’s career, one gains a general un¬ 
derstanding of the strengths of the batter or pitcher and learns when these 
players achieved peak performances. Albert (2002) demonstrates the value 
of looking at career trajectories to better understand the growth and deteri¬ 
oration of player’s batting abilities. 
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Standard Deviations of SO Rates 



I960 


1980 

Season 


2000 


Type 

Batter 

Pitcher 


Figure 6: Plot of standard deviation of strikeout rates for batters (at least 
100 AB) and for pitchers for seasons 1960 through 2012. 

The four observed rates have different averages and spreads, and as we see 
from Figures 5 and 6, the averages and spreads can change dramatically over 
different seasons. We use residuals from the predictive distribution to stan¬ 
dardize these rates. Let y denote the number of successes in n opportunities 
for a player in a particular season and suppose the underlying probabilities of 
the players follow a beta curve with mean rj and precision K. The predictive 
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density of the rate y/n has mean 77 and standard deviation 


SD Wn) = ^(1 - 0 ) (i + 

When the exchangeable model is fit, one obtains estimates of the random 
effects parameters fj and K, and obtains an estimate of the standard deviation 
SD(y/n). Define the standardized residual 

_ = y/n - fj 
SD(y/n) 

In the following plots of the standardized residuals of the walk/hit-by-pitch 
rates, strikeout rates, home run rates, and hit-in-play rates will be displayed 
to show special strengths of hitters and pitchers. 

9.2 Batter Trajectories 

The graphs of the standardized rates are displayed for the careers of Mickey 
Mantle in Figure 7 and Ichiro Suzuki in Figure 8 . Looking at the four graphs 
of Figure 7 in a clockwise manner from the upper-left, one sees 

• Mantle drew many walks/HBP and his walk/HBP rate actually in¬ 
creased during his career. 

• Mantle had an above-average strikeout rate. 

• His home run rate hit a peak during the middle of his career. 

• His in-play hit rate decreased towards the end of his career. 

In contrast, by looking at Figure 8 , one sees that Suzuki had consistent low 
walk/HBP, strikeout, and home run rates throughout his career. He was 
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especially good in his hit-in-play rate, although there was much variability 
in these rates and showed a decrease towards the end of his career. 


Mickey Mantle 


1955 1960 1965 1955 1960 1965 

Season 


Figure 7: Standardized residuals of the four rates for Mickey Mantle. 


9.3 Pitcher Trajectories 

These displays of standardized rates are also helpful for understanding the 
strengths of pitchers in the history of baseball. Figures 9 and 10 display the 
standardized rates for the Hall of Fame pitchers Greg Maddux and Steve 
Carlton. Maddux was famous for his low walk rate and generally low ERA. 
Looking at the trajectories of his rates in Figure 9, one sees that Maddux’s 
best walk rates occurred during the last half of his career. His best strikeout 
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Ichiro Suzuki 


2005 2010 2005 2010 

Season 


Figure 8: Standardized residuals of the four rates for Ichiro Suzuki. 

rate, home run rate, and HIP rate occurred about 1995 and all three of these 
rates deteriorated from 1995 until his retirement in 2008. In contrast, one 
sees from Figure 10 that Carlton had a slightly below average walk rate and 
a high strikeout rate during his career. Since all of these rates significantly 
deteriorated towards the end of his career, perhaps Carlton should have re¬ 
tired a few years earlier. Based on these graphs, Carlton’s peak season in 
terms of performance was about 1980, the season when the Phillies won the 
World Series. 
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Greg Maddux 


BB Rate 


SO Rate 



1985 1990 1995 2000 2005 1985 1990 1995 2000 2005 

Season 


Figure 9: Standardized residuals of the four rates for Greg Maddux. 


10 FIP Measures 


10.1 Introduction 


Recently, there has been an increased emphasis on the use of helding-independent- 
performance (FIP) measures of pitchers. The idea is to construct a measure 
based on the outcomes such as walks, hit-by-pitches, strikeouts, and home 
runs that a pitcher directly controls. The usual definition of FIP is given by 


FIP 


13 HR + 3 {BB + HBP) - 2 SO 

IP 


+ constant, 


where HR, BB, HBP, and SO are the counts of these different events, IP 
is the innings pitched, and constant is a constant defined to ensure that the 
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Figure 10: Standardized residuals of the four rates for Steve Carlton. 

average FIP is approximately equal to the league ERA. 

Although FIP is defined in terms of counts, it is straightforward to 
write it as a function of the four rates SO.Rate, HR.Rate, BABIP, and 
Walk.Rate 1 . Let BFP denote the count of batters faced, then 


HR 

BB + HBP 
SO 
IP 


BFP( 1 — Walk.Rate)( 1 — SO. Rate) HR. Rate 

BFP x Walk.Rate 

BFP( 1 - Walk.Rate)SO.Rate 

-BFP{ 1 — W alk.Rate)[SO .Rate 
o 

+ (1 - SO.Rate){ 1 - HR.Rate)( 1 - BABIP)] 
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Substituting these expressions into the formula and ignoring the constant 
term, the FIP measure is expressed solely in terms of these four rates. Al¬ 
though on face value, the FIP measure seems to depend on the sample size 
(the number of batters faced), the value of BFP cancels out in the substi¬ 
tution. 

10.2 Estimation of FIP Ability 

All of the observed rates are estimates of the underlying probabilities of those 
events. If we take the expression of FIP , ignoring the constant, and replace 
the rates with probabilities, we get an expression for a pitcher’s FIP ability 
denoted by pfip- 

_ 39(1 — Pbb)( 1 ~ Pso)Phr + 9 Pbb — 6(1 — Pbb)pso 
(1 — Pbb){pso + (1 — Pso){ 1 — Phr)( 1 — Phip )) 

Using data for a single season, we can use separate exchangeable models 

to estimate the walk probabilities {p J bb}i the strikeout probabilities {p 3 so}i 

the home run probabilities {p##}, and the hit-in-play probabilities {Phip} 

for all pitchers. If we substitute the probability estimates into the Pfip 

formula, we get new estimates at the observed FIP measures for all pitchers 

in a particular season. 

10.3 Performance 

Based on our earlier work, one would anticipate that our new estimates of 
FIP ability would be superior to usual estimates in predicting the FIP 
values of the pitchers in the following season. As in our evaluation of the 
performance of the improved batting probabilities, the new estimates can be 



compared with exchangeable estimates based on the standard representation 
of the FIP statistic. 

For a given pitcher, suppose one collects the measurement 13HR+3(BB+ 
HBP) — 2 SO for each inning pitched. If the pitcher pitches for N = IP 
innings, then the measurements can be denoted by Y \,..., Y/v and the FIP 
statistic is simply the sample mean FIP = Y. It is reasonable to assume 
that Y is normal with mean /ipjp and variance cr 2 /N, where o reflects the 
variability of the values of Yj within innings. 

Based on this representation, one can estimate the FIP abilities {/ip ?p } 
by use of an exchangeable model where the abilities are assigned a normal 
curve with mean /j and standard deviation r, and a vague prior is assigned 
to (/i, r). By fitting this model, one shrinks the observed FIP values for the 
pitchers towards an average value. 

Again a prediction experiment is used to predict the FIP values for all 
pitchers from a season given these measures from the previous season. The 
“standard” method predicts the FIP values using the single exchangeable 
model, and the “component” method first separately estimates the four sets 
of probabilities with exchangeable models, and then substitutes these esti¬ 
mates in the formula to obtain FIP predictions. As might be expected, the 
component method results in a smaller prediction error for practically all of 
the seasons of the study. This again demonstrates the value of this “divide 
and conquer” approach to obtain superior estimates of pitcher characteristics 
that are functions of the underlying probabilities. 


29 



11 Concluding Comments 


In the sabermetrics literature, the regression effect is well known; to predict 
a batter’s hitting rate for a given season, one takes one’s previous season’s 
hitting average and move this estimate towards an average. This paper ex¬ 
tends this approach to estimating a batting measure that is a function of 
different rates. Apply the random effects model to get accurate estimates 
at the component rates for all players, and then substitute these estimates 
into the function to get improved predictions of the batting measures. This 
approach was easy to apply for the batting probability and on-base probabil¬ 
ities situations due to the convenient factorization of the likelihood and use 
of independent exchangeable prior distributions. 

The choice of a single beta random effects curve was chosen for con¬ 
venience due to attractive analytical features, but this “component” ap¬ 
proach can be used for any choice of random effects model. For example, 
one may wish to use covariates in modeling the probabilities that hitters get 
a hit on balls put in play. If p J HIP is the probability that the jth player 
gets a hit, then one could assume that Vhip,^-,Phip are independent from 
beta^ 1 , K ), ...,beta(i] N , K ) distributions where the prior means satisfy the 
logistic model 

l°g ( j 3 — j = Aj + fax 3 , 

where x 3 is a relevant predictor such as the speed of the ball off the bat. As 
before, the prior parameters (A, Pi, K ) would be assigned a weakly informa¬ 
tive prior to complete the model. 

The FIP measure was motivated from the basic observation that a team 
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defense, not just a pitcher, prevents runs, and one wishes to devise alternative 
measures that isolate a pitcher’s effectiveness. In a similar fashion, the goal 
here is to isolate the different components of a hitter’s effectiveness. These 
component estimates are useful by themselves, but they are also helpful in 
estimating ensemble measures of ability such as the probability of getting on 
base. 
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